Linear Model Selection, Regularization and Moving Beyond Linearity

Module 09

Ray J. Hoobler

Libraries

Code
library(tidyverse)
library(broom) # For tidying model output
library(leaps) # For best subset selection 
library(readxl) # For reading Excel files
library(patchwork) # For combining plots
library(latex2exp) # For LaTeX expressions
library(rlang) # For working with formulas
library(glmnet) # For ridge regression and the lasso  

Linear Model Selection and Regularization

Subset Selection (1/3)

Best Subset Selection

Stepwise Selection

Choosing the Optimal Model

Subset Selection (2/3)

For a linear model with \(p\) predictors

\[ Y = \beta_0 + \beta_1X_1 + \beta_2X_2 + \ldots + \beta_pX_p + \epsilon \]

there are \(2^p\) possible models.


For example with \(p=3\) predictors, there are \(2^3 = 8\) possible models. They can be expressed as:

\[ \begin{align*} Y &= \beta_0 + \beta_1X_1 + \beta_2X_2 + \beta_3X_3 + \epsilon \\ Y &= \beta_0 + \beta_1X_1 + \beta_2X_2 + \epsilon \\ Y &= \beta_0 + \beta_1X_1 + \beta_3X_3 + \epsilon \\ Y &= \beta_0 + \beta_2X_2 + \beta_3X_3 + \epsilon \\ Y &= \beta_0 + \beta_1X_1 + \epsilon \\ Y &= \beta_0 + \beta_2X_2 + \epsilon \\ Y &= \beta_0 + \beta_3X_3 + \epsilon \\ Y &= \beta_0 + \epsilon \\ \end{align*} \]

Subset Selection (3/3)

What are we trying to achieve?

Prediction Accuracy

Constraints on the number of predictors can lead to better prediction accuracy.

Model Interpretability

A model with fewer predictors is easier to interpret.

Best Subset Selection (1/9)

ISLR Algorithm 6.1

  1. Let \(M_0\) denote the null model, which contains no predictors.
  2. For \(k=1,2,\ldots,p\):
    1. Fit all \(\binom{p}{k}\) models that contain exactly \(k\) predictors.
    2. Pick the best among these models, and call it \(M_k\). Here best is defined as having the smallest RSS or highest \(R^2\).
  3. Select a single best model from among \(M_0, M_1, \ldots, M_p\) using cross-validated prediction error, \(C_p\), BIC, or adjusted \(R^2\).

Note:

Step 2 is preformed on a training set.
Step 3 is performed on a validation set.

\[ \binom{p}{k} = (\textrm{p choose k})= \frac{p!}{k!(p-k)!} \]

Best Subset Selection (2/9)

Load the concrete data set

Variable Information:

Given is the variable name, variable type, the measurement unit and a brief description. The concrete compressive strength is the regression problem. The order of this listing corresponds to the order of numerals along the rows of the database.

Name – Data Type – Measurement – Description

  • Cement (component 1) – quantitative – kg in a m3 mixture – Input Variable
  • Blast Furnace Slag (component 2) – quantitative – kg in a m3 mixture – Input Variable
  • Fly Ash (component 3) – quantitative – kg in a m3 mixture – Input Variable
  • Water (component 4) – quantitative – kg in a m3 mixture – Input Variable
  • Superplasticizer (component 5) – quantitative – kg in a m3 mixture – Input Variable
  • Coarse Aggregate (component 6) – quantitative – kg in a m3 mixture – Input Variable
  • Fine Aggregate (component 7) – quantitative – kg in a m3 mixture – Input Variable
  • Age – quantitative – Day (1~365) – Input Variable
  • Concrete compressive strength – quantitative – MPa – Output Variable
Code
concrete_column_names <- c("cement", "slag", "ash", "water", "superplasticizer", "coarse_aggregate", "fine_aggregate", "age", "strength")

concrete <- read_xls("datasets/concrete+compressive+strength/Concrete_Data.xls") |> 
  rename_with(~ concrete_column_names)

concrete

Best Subset Selection (3/9)

Code
regfit_full_8 <- regsubsets(strength ~ ., data = concrete, nvmax = 8)

summary(regfit_full_8)
Subset selection object
Call: regsubsets.formula(strength ~ ., data = concrete, nvmax = 8)
8 Variables  (and intercept)
                 Forced in Forced out
cement               FALSE      FALSE
slag                 FALSE      FALSE
ash                  FALSE      FALSE
water                FALSE      FALSE
superplasticizer     FALSE      FALSE
coarse_aggregate     FALSE      FALSE
fine_aggregate       FALSE      FALSE
age                  FALSE      FALSE
1 subsets of each size up to 8
Selection Algorithm: exhaustive
         cement slag ash water superplasticizer coarse_aggregate fine_aggregate
1  ( 1 ) "*"    " "  " " " "   " "              " "              " "           
2  ( 1 ) "*"    " "  " " " "   "*"              " "              " "           
3  ( 1 ) "*"    " "  " " " "   "*"              " "              " "           
4  ( 1 ) "*"    "*"  " " "*"   " "              " "              " "           
5  ( 1 ) "*"    "*"  "*" "*"   " "              " "              " "           
6  ( 1 ) "*"    "*"  "*" "*"   "*"              " "              " "           
7  ( 1 ) "*"    "*"  "*" "*"   "*"              "*"              " "           
8  ( 1 ) "*"    "*"  "*" "*"   "*"              "*"              "*"           
         age
1  ( 1 ) " "
2  ( 1 ) " "
3  ( 1 ) "*"
4  ( 1 ) "*"
5  ( 1 ) "*"
6  ( 1 ) "*"
7  ( 1 ) "*"
8  ( 1 ) "*"

Best Subset Selection (4/9)

Code
regfit_full <- regsubsets(strength ~ ., data = concrete, nvmax = 8, nbest = 256, really.big = TRUE)

summary(regfit_full)
Subset selection object
Call: regsubsets.formula(strength ~ ., data = concrete, nvmax = 8, 
    nbest = 256, really.big = TRUE)
8 Variables  (and intercept)
                 Forced in Forced out
cement               FALSE      FALSE
slag                 FALSE      FALSE
ash                  FALSE      FALSE
water                FALSE      FALSE
superplasticizer     FALSE      FALSE
coarse_aggregate     FALSE      FALSE
fine_aggregate       FALSE      FALSE
age                  FALSE      FALSE
256 subsets of each size up to 8
Selection Algorithm: exhaustive
          cement slag ash water superplasticizer coarse_aggregate
1  ( 1 )  "*"    " "  " " " "   " "              " "             
1  ( 2 )  " "    " "  " " " "   "*"              " "             
1  ( 3 )  " "    " "  " " " "   " "              " "             
1  ( 4 )  " "    " "  " " "*"   " "              " "             
1  ( 5 )  " "    " "  " " " "   " "              " "             
1  ( 6 )  " "    " "  " " " "   " "              "*"             
1  ( 7 )  " "    "*"  " " " "   " "              " "             
1  ( 8 )  " "    " "  "*" " "   " "              " "             
2  ( 1 )  "*"    " "  " " " "   "*"              " "             
2  ( 2 )  "*"    " "  " " " "   " "              " "             
2  ( 3 )  "*"    "*"  " " " "   " "              " "             
2  ( 4 )  "*"    " "  " " "*"   " "              " "             
2  ( 5 )  " "    " "  " " " "   "*"              " "             
2  ( 6 )  " "    " "  " " "*"   " "              " "             
2  ( 7 )  "*"    " "  " " " "   " "              "*"             
2  ( 8 )  "*"    " "  "*" " "   " "              " "             
2  ( 9 )  "*"    " "  " " " "   " "              " "             
2  ( 10 ) " "    " "  "*" " "   "*"              " "             
2  ( 11 ) " "    " "  " " " "   "*"              " "             
2  ( 12 ) " "    " "  " " "*"   " "              " "             
2  ( 13 ) " "    "*"  " " " "   "*"              " "             
2  ( 14 ) " "    " "  " " " "   "*"              "*"             
2  ( 15 ) " "    " "  " " "*"   "*"              " "             
2  ( 16 ) " "    " "  " " " "   " "              "*"             
2  ( 17 ) " "    " "  " " "*"   " "              "*"             
2  ( 18 ) " "    "*"  " " " "   " "              " "             
2  ( 19 ) " "    " "  " " " "   " "              " "             
2  ( 20 ) " "    " "  "*" "*"   " "              " "             
2  ( 21 ) " "    "*"  " " "*"   " "              " "             
2  ( 22 ) " "    " "  "*" " "   " "              " "             
2  ( 23 ) " "    " "  " " " "   " "              "*"             
2  ( 24 ) " "    " "  "*" " "   " "              "*"             
2  ( 25 ) " "    " "  "*" " "   " "              " "             
2  ( 26 ) " "    "*"  " " " "   " "              " "             
2  ( 27 ) " "    "*"  " " " "   " "              "*"             
2  ( 28 ) " "    "*"  "*" " "   " "              " "             
3  ( 1 )  "*"    " "  " " " "   "*"              " "             
3  ( 2 )  "*"    " "  " " "*"   " "              " "             
3  ( 3 )  "*"    "*"  " " " "   "*"              " "             
3  ( 4 )  "*"    "*"  " " " "   " "              " "             
3  ( 5 )  "*"    "*"  " " "*"   " "              " "             
3  ( 6 )  "*"    "*"  "*" " "   " "              " "             
3  ( 7 )  "*"    " "  " " " "   "*"              " "             
3  ( 8 )  " "    " "  " " "*"   " "              " "             
3  ( 9 )  "*"    " "  " " "*"   "*"              " "             
3  ( 10 ) "*"    " "  "*" " "   "*"              " "             
3  ( 11 ) "*"    " "  " " " "   "*"              "*"             
3  ( 12 ) "*"    " "  " " "*"   " "              " "             
3  ( 13 ) " "    " "  "*" " "   "*"              " "             
3  ( 14 ) "*"    " "  "*" " "   " "              " "             
3  ( 15 ) "*"    " "  " " " "   " "              "*"             
3  ( 16 ) " "    " "  " " " "   "*"              " "             
3  ( 17 ) "*"    " "  " " "*"   " "              "*"             
3  ( 18 ) "*"    " "  " " " "   " "              " "             
3  ( 19 ) "*"    "*"  " " " "   " "              " "             
3  ( 20 ) "*"    "*"  " " " "   " "              "*"             
3  ( 21 ) " "    " "  " " "*"   " "              "*"             
3  ( 22 ) " "    " "  " " "*"   "*"              " "             
3  ( 23 ) " "    "*"  " " " "   "*"              " "             
3  ( 24 ) "*"    " "  "*" "*"   " "              " "             
3  ( 25 ) " "    " "  " " "*"   " "              "*"             
3  ( 26 ) " "    "*"  " " "*"   " "              " "             
3  ( 27 ) " "    " "  " " " "   "*"              "*"             
3  ( 28 ) " "    " "  "*" "*"   " "              " "             
3  ( 29 ) " "    " "  "*" " "   "*"              " "             
3  ( 30 ) "*"    " "  "*" " "   " "              "*"             
3  ( 31 ) "*"    " "  " " " "   " "              "*"             
3  ( 32 ) "*"    " "  "*" " "   " "              " "             
3  ( 33 ) " "    " "  "*" "*"   " "              " "             
3  ( 34 ) " "    " "  " " "*"   "*"              " "             
3  ( 35 ) " "    " "  " " " "   "*"              "*"             
3  ( 36 ) " "    " "  "*" "*"   "*"              " "             
3  ( 37 ) " "    " "  "*" " "   "*"              "*"             
3  ( 38 ) " "    "*"  "*" " "   "*"              " "             
3  ( 39 ) " "    "*"  " " "*"   " "              " "             
3  ( 40 ) " "    "*"  " " " "   "*"              " "             
3  ( 41 ) " "    " "  "*" "*"   " "              "*"             
3  ( 42 ) " "    " "  " " " "   " "              "*"             
3  ( 43 ) " "    "*"  " " "*"   "*"              " "             
3  ( 44 ) " "    " "  " " "*"   "*"              "*"             
3  ( 45 ) " "    "*"  " " " "   "*"              "*"             
3  ( 46 ) " "    "*"  " " " "   " "              "*"             
3  ( 47 ) " "    "*"  " " "*"   " "              "*"             
3  ( 48 ) " "    " "  "*" " "   " "              "*"             
3  ( 49 ) " "    "*"  " " " "   " "              " "             
3  ( 50 ) " "    "*"  "*" "*"   " "              " "             
3  ( 51 ) " "    "*"  "*" " "   " "              " "             
3  ( 52 ) " "    " "  "*" " "   " "              " "             
3  ( 53 ) " "    " "  "*" " "   " "              "*"             
3  ( 54 ) " "    "*"  " " " "   " "              "*"             
3  ( 55 ) " "    "*"  "*" " "   " "              "*"             
3  ( 56 ) " "    "*"  "*" " "   " "              " "             
4  ( 1 )  "*"    "*"  " " "*"   " "              " "             
4  ( 2 )  "*"    "*"  " " " "   "*"              " "             
4  ( 3 )  "*"    "*"  "*" " "   " "              " "             
4  ( 4 )  "*"    " "  " " "*"   "*"              " "             
4  ( 5 )  "*"    " "  " " " "   "*"              " "             
4  ( 6 )  "*"    " "  " " "*"   " "              " "             
4  ( 7 )  " "    " "  " " "*"   " "              "*"             
4  ( 8 )  "*"    " "  " " "*"   " "              "*"             
4  ( 9 )  "*"    " "  "*" " "   "*"              " "             
4  ( 10 ) "*"    " "  " " " "   "*"              "*"             
4  ( 11 ) "*"    " "  "*" "*"   " "              " "             
4  ( 12 ) "*"    "*"  "*" "*"   " "              " "             
4  ( 13 ) "*"    "*"  "*" " "   "*"              " "             
4  ( 14 ) "*"    "*"  " " "*"   "*"              " "             
4  ( 15 ) "*"    "*"  " " " "   " "              " "             
4  ( 16 ) "*"    "*"  " " " "   "*"              "*"             
4  ( 17 ) "*"    "*"  " " " "   "*"              " "             
4  ( 18 ) "*"    "*"  "*" " "   " "              " "             
4  ( 19 ) "*"    "*"  " " " "   " "              "*"             
4  ( 20 ) "*"    " "  " " "*"   " "              "*"             
4  ( 21 ) "*"    "*"  " " "*"   " "              " "             
4  ( 22 ) " "    " "  " " "*"   "*"              " "             
4  ( 23 ) "*"    "*"  " " "*"   " "              "*"             
4  ( 24 ) "*"    "*"  "*" " "   " "              "*"             
4  ( 25 ) " "    " "  "*" " "   "*"              " "             
4  ( 26 ) " "    " "  "*" "*"   " "              " "             
4  ( 27 ) "*"    " "  " " "*"   "*"              " "             
4  ( 28 ) " "    "*"  " " "*"   " "              " "             
4  ( 29 ) "*"    " "  "*" " "   "*"              " "             
4  ( 30 ) "*"    " "  " " " "   "*"              "*"             
4  ( 31 ) " "    " "  "*" "*"   "*"              " "             
4  ( 32 ) "*"    " "  "*" " "   " "              "*"             
4  ( 33 ) " "    " "  "*" "*"   " "              "*"             
4  ( 34 ) "*"    " "  " " "*"   "*"              "*"             
4  ( 35 ) "*"    " "  "*" "*"   "*"              " "             
4  ( 36 ) " "    "*"  "*" " "   "*"              " "             
4  ( 37 ) "*"    " "  "*" " "   "*"              "*"             
4  ( 38 ) " "    " "  "*" " "   "*"              "*"             
4  ( 39 ) "*"    " "  "*" "*"   " "              " "             
4  ( 40 ) "*"    " "  "*" " "   " "              " "             
4  ( 41 ) " "    " "  "*" "*"   " "              "*"             
4  ( 42 ) " "    "*"  " " "*"   "*"              " "             
4  ( 43 ) " "    " "  " " " "   "*"              "*"             
4  ( 44 ) " "    "*"  " " " "   "*"              " "             
4  ( 45 ) "*"    " "  " " " "   " "              "*"             
4  ( 46 ) " "    "*"  " " "*"   " "              "*"             
4  ( 47 ) " "    " "  " " "*"   "*"              "*"             
4  ( 48 ) "*"    " "  "*" "*"   " "              "*"             
4  ( 49 ) "*"    "*"  " " " "   " "              "*"             
4  ( 50 ) " "    "*"  " " " "   "*"              "*"             
4  ( 51 ) " "    "*"  "*" "*"   " "              " "             
4  ( 52 ) " "    "*"  " " "*"   " "              "*"             
4  ( 53 ) " "    " "  " " "*"   "*"              "*"             
4  ( 54 ) " "    " "  "*" "*"   "*"              " "             
4  ( 55 ) "*"    " "  "*" " "   " "              "*"             
4  ( 56 ) " "    " "  "*" " "   "*"              "*"             
4  ( 57 ) " "    "*"  "*" " "   "*"              " "             
4  ( 58 ) " "    "*"  " " "*"   "*"              " "             
4  ( 59 ) " "    "*"  "*" "*"   " "              " "             
4  ( 60 ) " "    " "  "*" "*"   "*"              "*"             
4  ( 61 ) " "    "*"  " " " "   "*"              "*"             
4  ( 62 ) " "    "*"  "*" "*"   "*"              " "             
4  ( 63 ) " "    "*"  "*" " "   "*"              "*"             
4  ( 64 ) " "    "*"  "*" "*"   " "              "*"             
4  ( 65 ) " "    "*"  " " "*"   "*"              "*"             
4  ( 66 ) " "    "*"  " " " "   " "              "*"             
4  ( 67 ) " "    " "  "*" " "   " "              "*"             
4  ( 68 ) " "    "*"  "*" " "   " "              "*"             
4  ( 69 ) " "    "*"  "*" " "   " "              " "             
4  ( 70 ) " "    "*"  "*" " "   " "              "*"             
5  ( 1 )  "*"    "*"  "*" "*"   " "              " "             
5  ( 2 )  "*"    "*"  " " "*"   "*"              " "             
5  ( 3 )  "*"    "*"  "*" " "   "*"              " "             
5  ( 4 )  "*"    " "  " " "*"   " "              "*"             
5  ( 5 )  "*"    "*"  " " "*"   " "              "*"             
5  ( 6 )  "*"    "*"  " " "*"   " "              " "             
5  ( 7 )  "*"    "*"  "*" " "   " "              " "             
5  ( 8 )  "*"    "*"  " " " "   "*"              "*"             
5  ( 9 )  "*"    "*"  " " " "   "*"              " "             
5  ( 10 ) "*"    " "  " " "*"   "*"              " "             
5  ( 11 ) "*"    "*"  "*" " "   " "              "*"             
5  ( 12 ) " "    " "  "*" "*"   " "              "*"             
5  ( 13 ) "*"    " "  " " "*"   "*"              "*"             
5  ( 14 ) "*"    " "  "*" "*"   "*"              " "             
5  ( 15 ) "*"    " "  "*" " "   "*"              " "             
5  ( 16 ) "*"    " "  " " " "   "*"              "*"             
5  ( 17 ) "*"    " "  "*" "*"   " "              " "             
5  ( 18 ) " "    "*"  " " "*"   " "              "*"             
5  ( 19 ) " "    " "  " " "*"   "*"              "*"             
5  ( 20 ) "*"    " "  "*" "*"   " "              "*"             
5  ( 21 ) "*"    " "  "*" " "   "*"              "*"             
5  ( 22 ) " "    " "  "*" "*"   "*"              " "             
5  ( 23 ) "*"    "*"  "*" "*"   "*"              " "             
5  ( 24 ) "*"    "*"  "*" "*"   " "              " "             
5  ( 25 ) "*"    "*"  "*" "*"   " "              "*"             
5  ( 26 ) "*"    "*"  "*" " "   "*"              "*"             
5  ( 27 ) "*"    "*"  " " "*"   "*"              " "             
5  ( 28 ) "*"    "*"  "*" " "   "*"              " "             
5  ( 29 ) "*"    "*"  " " "*"   " "              "*"             
5  ( 30 ) "*"    "*"  "*" " "   " "              "*"             
5  ( 31 ) "*"    "*"  " " "*"   "*"              "*"             
5  ( 32 ) "*"    "*"  " " " "   " "              "*"             
5  ( 33 ) "*"    "*"  " " " "   "*"              "*"             
5  ( 34 ) "*"    " "  "*" "*"   " "              "*"             
5  ( 35 ) " "    "*"  " " "*"   "*"              " "             
5  ( 36 ) "*"    " "  " " "*"   "*"              "*"             
5  ( 37 ) " "    " "  "*" " "   "*"              "*"             
5  ( 38 ) " "    "*"  "*" "*"   " "              " "             
5  ( 39 ) " "    "*"  "*" " "   "*"              " "             
5  ( 40 ) "*"    " "  "*" "*"   "*"              " "             
5  ( 41 ) " "    " "  "*" "*"   "*"              "*"             
5  ( 42 ) " "    "*"  "*" "*"   "*"              " "             
5  ( 43 ) " "    "*"  "*" "*"   " "              "*"             
5  ( 44 ) "*"    " "  "*" " "   "*"              "*"             
5  ( 45 ) " "    " "  "*" "*"   "*"              "*"             
5  ( 46 ) "*"    " "  "*" " "   " "              "*"             
5  ( 47 ) " "    "*"  " " "*"   "*"              "*"             
5  ( 48 ) "*"    " "  "*" "*"   "*"              "*"             
5  ( 49 ) " "    "*"  "*" "*"   " "              "*"             
5  ( 50 ) " "    "*"  "*" " "   "*"              "*"             
5  ( 51 ) " "    "*"  " " " "   "*"              "*"             
5  ( 52 ) " "    "*"  " " "*"   "*"              "*"             
5  ( 53 ) " "    "*"  "*" "*"   "*"              " "             
5  ( 54 ) " "    "*"  "*" " "   "*"              "*"             
5  ( 55 ) " "    "*"  "*" "*"   "*"              "*"             
5  ( 56 ) " "    "*"  "*" " "   " "              "*"             
6  ( 1 )  "*"    "*"  "*" "*"   "*"              " "             
6  ( 2 )  "*"    "*"  "*" "*"   " "              " "             
6  ( 3 )  "*"    "*"  "*" "*"   " "              "*"             
6  ( 4 )  "*"    "*"  "*" " "   " "              "*"             
6  ( 5 )  "*"    "*"  " " "*"   " "              "*"             
6  ( 6 )  "*"    "*"  " " "*"   "*"              " "             
6  ( 7 )  "*"    "*"  "*" " "   "*"              "*"             
6  ( 8 )  "*"    "*"  " " "*"   "*"              "*"             
6  ( 9 )  "*"    "*"  "*" " "   "*"              " "             
6  ( 10 ) "*"    " "  "*" "*"   " "              "*"             
6  ( 11 ) "*"    " "  " " "*"   "*"              "*"             
6  ( 12 ) "*"    "*"  " " " "   "*"              "*"             
6  ( 13 ) " "    "*"  "*" "*"   " "              "*"             
6  ( 14 ) "*"    " "  "*" "*"   "*"              " "             
6  ( 15 ) " "    " "  "*" "*"   "*"              "*"             
6  ( 16 ) "*"    " "  "*" "*"   "*"              "*"             
6  ( 17 ) "*"    " "  "*" " "   "*"              "*"             
6  ( 18 ) " "    "*"  " " "*"   "*"              "*"             
6  ( 19 ) " "    "*"  "*" "*"   "*"              " "             
6  ( 20 ) "*"    "*"  "*" "*"   "*"              "*"             
6  ( 21 ) "*"    "*"  "*" "*"   "*"              " "             
6  ( 22 ) "*"    "*"  "*" " "   "*"              "*"             
6  ( 23 ) "*"    "*"  "*" "*"   " "              "*"             
6  ( 24 ) "*"    "*"  " " "*"   "*"              "*"             
6  ( 25 ) "*"    " "  "*" "*"   "*"              "*"             
6  ( 26 ) " "    "*"  "*" " "   "*"              "*"             
6  ( 27 ) " "    "*"  "*" "*"   "*"              "*"             
6  ( 28 ) " "    "*"  "*" "*"   "*"              "*"             
7  ( 1 )  "*"    "*"  "*" "*"   "*"              "*"             
7  ( 2 )  "*"    "*"  "*" "*"   "*"              " "             
7  ( 3 )  "*"    "*"  "*" "*"   " "              "*"             
7  ( 4 )  "*"    "*"  "*" " "   "*"              "*"             
7  ( 5 )  "*"    "*"  " " "*"   "*"              "*"             
7  ( 6 )  "*"    " "  "*" "*"   "*"              "*"             
7  ( 7 )  " "    "*"  "*" "*"   "*"              "*"             
7  ( 8 )  "*"    "*"  "*" "*"   "*"              "*"             
8  ( 1 )  "*"    "*"  "*" "*"   "*"              "*"             
          fine_aggregate age
1  ( 1 )  " "            " "
1  ( 2 )  " "            " "
1  ( 3 )  " "            "*"
1  ( 4 )  " "            " "
1  ( 5 )  "*"            " "
1  ( 6 )  " "            " "
1  ( 7 )  " "            " "
1  ( 8 )  " "            " "
2  ( 1 )  " "            " "
2  ( 2 )  " "            "*"
2  ( 3 )  " "            " "
2  ( 4 )  " "            " "
2  ( 5 )  " "            "*"
2  ( 6 )  " "            "*"
2  ( 7 )  " "            " "
2  ( 8 )  " "            " "
2  ( 9 )  "*"            " "
2  ( 10 ) " "            " "
2  ( 11 ) "*"            " "
2  ( 12 ) "*"            " "
2  ( 13 ) " "            " "
2  ( 14 ) " "            " "
2  ( 15 ) " "            " "
2  ( 16 ) " "            "*"
2  ( 17 ) " "            " "
2  ( 18 ) " "            "*"
2  ( 19 ) "*"            "*"
2  ( 20 ) " "            " "
2  ( 21 ) " "            " "
2  ( 22 ) " "            "*"
2  ( 23 ) "*"            " "
2  ( 24 ) " "            " "
2  ( 25 ) "*"            " "
2  ( 26 ) "*"            " "
2  ( 27 ) " "            " "
2  ( 28 ) " "            " "
3  ( 1 )  " "            "*"
3  ( 2 )  " "            "*"
3  ( 3 )  " "            " "
3  ( 4 )  " "            "*"
3  ( 5 )  " "            " "
3  ( 6 )  " "            " "
3  ( 7 )  "*"            " "
3  ( 8 )  "*"            "*"
3  ( 9 )  " "            " "
3  ( 10 ) " "            " "
3  ( 11 ) " "            " "
3  ( 12 ) "*"            " "
3  ( 13 ) " "            "*"
3  ( 14 ) " "            "*"
3  ( 15 ) " "            "*"
3  ( 16 ) "*"            "*"
3  ( 17 ) " "            " "
3  ( 18 ) "*"            "*"
3  ( 19 ) "*"            " "
3  ( 20 ) " "            " "
3  ( 21 ) " "            "*"
3  ( 22 ) " "            "*"
3  ( 23 ) " "            "*"
3  ( 24 ) " "            " "
3  ( 25 ) "*"            " "
3  ( 26 ) " "            "*"
3  ( 27 ) " "            "*"
3  ( 28 ) " "            "*"
3  ( 29 ) "*"            " "
3  ( 30 ) " "            " "
3  ( 31 ) "*"            " "
3  ( 32 ) "*"            " "
3  ( 33 ) "*"            " "
3  ( 34 ) "*"            " "
3  ( 35 ) "*"            " "
3  ( 36 ) " "            " "
3  ( 37 ) " "            " "
3  ( 38 ) " "            " "
3  ( 39 ) "*"            " "
3  ( 40 ) "*"            " "
3  ( 41 ) " "            " "
3  ( 42 ) "*"            "*"
3  ( 43 ) " "            " "
3  ( 44 ) " "            " "
3  ( 45 ) " "            " "
3  ( 46 ) " "            "*"
3  ( 47 ) " "            " "
3  ( 48 ) " "            "*"
3  ( 49 ) "*"            "*"
3  ( 50 ) " "            " "
3  ( 51 ) " "            "*"
3  ( 52 ) "*"            "*"
3  ( 53 ) "*"            " "
3  ( 54 ) "*"            " "
3  ( 55 ) " "            " "
3  ( 56 ) "*"            " "
4  ( 1 )  " "            "*"
4  ( 2 )  " "            "*"
4  ( 3 )  " "            "*"
4  ( 4 )  " "            "*"
4  ( 5 )  "*"            "*"
4  ( 6 )  "*"            "*"
4  ( 7 )  "*"            "*"
4  ( 8 )  " "            "*"
4  ( 9 )  " "            "*"
4  ( 10 ) " "            "*"
4  ( 11 ) " "            "*"
4  ( 12 ) " "            " "
4  ( 13 ) " "            " "
4  ( 14 ) " "            " "
4  ( 15 ) "*"            "*"
4  ( 16 ) " "            " "
4  ( 17 ) "*"            " "
4  ( 18 ) "*"            " "
4  ( 19 ) " "            "*"
4  ( 20 ) "*"            " "
4  ( 21 ) "*"            " "
4  ( 22 ) "*"            "*"
4  ( 23 ) " "            " "
4  ( 24 ) " "            " "
4  ( 25 ) "*"            "*"
4  ( 26 ) "*"            "*"
4  ( 27 ) "*"            " "
4  ( 28 ) "*"            "*"
4  ( 29 ) "*"            " "
4  ( 30 ) "*"            " "
4  ( 31 ) " "            "*"
4  ( 32 ) " "            "*"
4  ( 33 ) "*"            " "
4  ( 34 ) " "            " "
4  ( 35 ) " "            " "
4  ( 36 ) " "            "*"
4  ( 37 ) " "            " "
4  ( 38 ) " "            "*"
4  ( 39 ) "*"            " "
4  ( 40 ) "*"            "*"
4  ( 41 ) " "            "*"
4  ( 42 ) " "            "*"
4  ( 43 ) "*"            "*"
4  ( 44 ) "*"            "*"
4  ( 45 ) "*"            "*"
4  ( 46 ) " "            "*"
4  ( 47 ) " "            "*"
4  ( 48 ) " "            " "
4  ( 49 ) "*"            " "
4  ( 50 ) " "            "*"
4  ( 51 ) " "            "*"
4  ( 52 ) "*"            " "
4  ( 53 ) "*"            " "
4  ( 54 ) "*"            " "
4  ( 55 ) "*"            " "
4  ( 56 ) "*"            " "
4  ( 57 ) "*"            " "
4  ( 58 ) "*"            " "
4  ( 59 ) "*"            " "
4  ( 60 ) " "            " "
4  ( 61 ) "*"            " "
4  ( 62 ) " "            " "
4  ( 63 ) " "            " "
4  ( 64 ) " "            " "
4  ( 65 ) " "            " "
4  ( 66 ) "*"            "*"
4  ( 67 ) "*"            "*"
4  ( 68 ) " "            "*"
4  ( 69 ) "*"            "*"
4  ( 70 ) "*"            " "
5  ( 1 )  " "            "*"
5  ( 2 )  " "            "*"
5  ( 3 )  " "            "*"
5  ( 4 )  "*"            "*"
5  ( 5 )  " "            "*"
5  ( 6 )  "*"            "*"
5  ( 7 )  "*"            "*"
5  ( 8 )  " "            "*"
5  ( 9 )  "*"            "*"
5  ( 10 ) "*"            "*"
5  ( 11 ) " "            "*"
5  ( 12 ) "*"            "*"
5  ( 13 ) " "            "*"
5  ( 14 ) " "            "*"
5  ( 15 ) "*"            "*"
5  ( 16 ) "*"            "*"
5  ( 17 ) "*"            "*"
5  ( 18 ) "*"            "*"
5  ( 19 ) "*"            "*"
5  ( 20 ) " "            "*"
5  ( 21 ) " "            "*"
5  ( 22 ) "*"            "*"
5  ( 23 ) " "            " "
5  ( 24 ) "*"            " "
5  ( 25 ) " "            " "
5  ( 26 ) " "            " "
5  ( 27 ) "*"            " "
5  ( 28 ) "*"            " "
5  ( 29 ) "*"            " "
5  ( 30 ) "*"            " "
5  ( 31 ) " "            " "
5  ( 32 ) "*"            "*"
5  ( 33 ) "*"            " "
5  ( 34 ) "*"            " "
5  ( 35 ) "*"            "*"
5  ( 36 ) "*"            " "
5  ( 37 ) "*"            "*"
5  ( 38 ) "*"            "*"
5  ( 39 ) "*"            "*"
5  ( 40 ) "*"            " "
5  ( 41 ) " "            "*"
5  ( 42 ) " "            "*"
5  ( 43 ) "*"            " "
5  ( 44 ) "*"            " "
5  ( 45 ) "*"            " "
5  ( 46 ) "*"            "*"
5  ( 47 ) " "            "*"
5  ( 48 ) " "            " "
5  ( 49 ) " "            "*"
5  ( 50 ) " "            "*"
5  ( 51 ) "*"            "*"
5  ( 52 ) "*"            " "
5  ( 53 ) "*"            " "
5  ( 54 ) "*"            " "
5  ( 55 ) " "            " "
5  ( 56 ) "*"            "*"
6  ( 1 )  " "            "*"
6  ( 2 )  "*"            "*"
6  ( 3 )  " "            "*"
6  ( 4 )  "*"            "*"
6  ( 5 )  "*"            "*"
6  ( 6 )  "*"            "*"
6  ( 7 )  " "            "*"
6  ( 8 )  " "            "*"
6  ( 9 )  "*"            "*"
6  ( 10 ) "*"            "*"
6  ( 11 ) "*"            "*"
6  ( 12 ) "*"            "*"
6  ( 13 ) "*"            "*"
6  ( 14 ) "*"            "*"
6  ( 15 ) "*"            "*"
6  ( 16 ) " "            "*"
6  ( 17 ) "*"            "*"
6  ( 18 ) "*"            "*"
6  ( 19 ) "*"            "*"
6  ( 20 ) " "            " "
6  ( 21 ) "*"            " "
6  ( 22 ) "*"            " "
6  ( 23 ) "*"            " "
6  ( 24 ) "*"            " "
6  ( 25 ) "*"            " "
6  ( 26 ) "*"            "*"
6  ( 27 ) " "            "*"
6  ( 28 ) "*"            " "
7  ( 1 )  " "            "*"
7  ( 2 )  "*"            "*"
7  ( 3 )  "*"            "*"
7  ( 4 )  "*"            "*"
7  ( 5 )  "*"            "*"
7  ( 6 )  "*"            "*"
7  ( 7 )  "*"            "*"
7  ( 8 )  "*"            " "
8  ( 1 )  "*"            "*"

Best Subset Selection (5/9)

A quick comparison of two models

Code
summary(lm(strength ~ cement, data = concrete))

Call:
lm(formula = strength ~ cement, data = concrete)

Residuals:
    Min      1Q  Median      3Q     Max 
-40.594 -10.952  -0.572   9.992  43.241 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 13.442795   1.296925   10.37   <2e-16 ***
cement       0.079580   0.004324   18.41   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 14.5 on 1028 degrees of freedom
Multiple R-squared:  0.2478,    Adjusted R-squared:  0.2471 
F-statistic: 338.7 on 1 and 1028 DF,  p-value: < 2.2e-16
Code
summary(lm(strength ~ age, data = concrete))

Call:
lm(formula = strength ~ age, data = concrete)

Residuals:
    Min      1Q  Median      3Q     Max 
-38.509 -11.290  -1.516   9.426  47.469 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 31.846436   0.606949   52.47   <2e-16 ***
age          0.086974   0.007789   11.17   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 15.78 on 1028 degrees of freedom
Multiple R-squared:  0.1082,    Adjusted R-squared:  0.1073 
F-statistic: 124.7 on 1 and 1028 DF,  p-value: < 2.2e-16

Best Subset Selection (6/9)

What’s available in the regsubsets() function?

(You will need to review the documentation for details.)

Code
names(summary(regfit_full))
[1] "which"  "rsq"    "rss"    "adjr2"  "cp"     "bic"    "outmat" "obj"   
  • which: A logical matrix indicating which elements are in each model

  • rsq: The r-squared for each model

  • rss: Residual sum of squares for each model

  • adjr2: Adjusted r-squared

  • cp: Mallows’ Cp

  • bic: Schwartz’s information criterion, BIC

  • outmat: A version of the which component that is formatted for printing

  • obj: A copy of the regsubsets object

Best Subset Selection (7/9)

Code
concrete_subset_adjr2 <- as_tibble(summary(regfit_full)$which, rownames = "model") |> 
  select(model) |> 
  bind_cols(as_tibble_col(summary(regfit_full)$adjr2, column_name = "adjr2")) |> 
  mutate(model = as_factor(model))

concete_model_max_adjr2 <- concrete_subset_adjr2 |> 
  group_by(model) |>
  summarise(adjr2 = max(adjr2)) |>
  ungroup()

concrete_subset_adjr2 |>
  ggplot() +
  geom_point(aes(x = model, y = adjr2), alpha = 1/5) +
  geom_line(aes(x = model, y = adjr2, group = 1),
            data = concete_model_max_adjr2, 
            color = "red") +
   geom_point(aes(x = model, y = adjr2),
            data = concete_model_max_adjr2, 
            color = "red") +
  labs(
    title = "Best Subset Selection for Concrete Data",
    subtitle = "Adjusted *R<sup>2</sup>* for each model",
    x = "Number of Predictors",
    y = NULL
  ) +
  theme_light() +
  theme(
   plot.title = ggtext::element_markdown(size = 16),
   plot.title.position = "plot",
   plot.subtitle = ggtext::element_markdown(size = 12)
  )

Best Subset Selection (8/9)

Focus on best models for each number of predictors

Code
regfit_full_8 <- regsubsets(strength ~ ., data = concrete, nvmax = 8, really.big = FALSE)

summary(regfit_full_8)
Subset selection object
Call: regsubsets.formula(strength ~ ., data = concrete, nvmax = 8, 
    really.big = FALSE)
8 Variables  (and intercept)
                 Forced in Forced out
cement               FALSE      FALSE
slag                 FALSE      FALSE
ash                  FALSE      FALSE
water                FALSE      FALSE
superplasticizer     FALSE      FALSE
coarse_aggregate     FALSE      FALSE
fine_aggregate       FALSE      FALSE
age                  FALSE      FALSE
1 subsets of each size up to 8
Selection Algorithm: exhaustive
         cement slag ash water superplasticizer coarse_aggregate fine_aggregate
1  ( 1 ) "*"    " "  " " " "   " "              " "              " "           
2  ( 1 ) "*"    " "  " " " "   "*"              " "              " "           
3  ( 1 ) "*"    " "  " " " "   "*"              " "              " "           
4  ( 1 ) "*"    "*"  " " "*"   " "              " "              " "           
5  ( 1 ) "*"    "*"  "*" "*"   " "              " "              " "           
6  ( 1 ) "*"    "*"  "*" "*"   "*"              " "              " "           
7  ( 1 ) "*"    "*"  "*" "*"   "*"              "*"              " "           
8  ( 1 ) "*"    "*"  "*" "*"   "*"              "*"              "*"           
         age
1  ( 1 ) " "
2  ( 1 ) " "
3  ( 1 ) "*"
4  ( 1 ) "*"
5  ( 1 ) "*"
6  ( 1 ) "*"
7  ( 1 ) "*"
8  ( 1 ) "*"

Best Subset Selection (9/9)

Code
as_tibble(summary(regfit_full_8)$which, rownames = "model") |> 
  mutate(
    adjr2 = summary(regfit_full_8)$adjr2,
    model = as_factor(model),
    ) |>
  select(-`(Intercept)`) |>
  pivot_longer(cols = 2:9, , names_to = "variable", values_to = "included") |> 
  mutate(variable = str_replace(variable, "_", " ")) |> 
  mutate(variable = reorder(as_factor(variable), adjr2)) |> 
  ggplot() +
  geom_tile(aes(x = model, y = variable, fill = included), color = "white") +
  labs(
    title = "Best Subset Selection for Concrete Data",
    subtitle = "Variables available for each model ordered by adjusted *R<sup>2</sup>*",
    x = "Number of Predictors",
    y = NULL,
    fill = "Varible included \nin the model?"
  ) +
  scale_fill_manual(values = c("red", "darkgrey")) +
  scale_x_discrete() +
  theme_minimal() +
  theme(
   plot.title = ggtext::element_markdown(size = 16),
   plot.title.position = "plot",
   plot.subtitle = ggtext::element_markdown(size = 12)
  ) 

Subset selection is a useful tool for selecting the best model for a given data set; however, it can be computationally expensive.

If the number of predictors is large, the chance of overfitting the model also increases.

Stepwise Selection

We can use Stepwise methods to evaluate models with a large number of predictors.

Forward Selection

Forward selection starts with the null model and adds predictors one at a time. The number of models fit is

\[ 1 + p(p+1)/2 \]

For our example of \(p = 8\) predictors, this would be \(1 + 8(8+1)/2 = 37\) models instead of the \(2^8 = 256\) models for best subset selection.

Backward Selection

Backward selection starts with the full model and removes predictors one at a time. The number of models fit is the same as for forward selection.

Stepwise Selection Algorithms

Forward Selection

  1. Start with the null model, \(M_0\), which contains no predictors.
  2. For \(k=0,1,\ldots,p-1\):
    1. Consider all \(p-k\) models that augment the predictors in \(M_k\) with one additional predictor.
    2. Choose the best among these \(p-k\) models, and call it \(M_{k+1}\). Here best is defined as having the smallest RSS or highest \(R^2\).
  3. Select a single best model from among \(M_0, M_1, \ldots, M_p\) using cross-validated prediction error, \(C_p\), BIC, or adjusted \(R^2\).

Backward Selection

  1. Start with the full model, \(M_p\), which contains all predictors.
  2. For \(k=p,p-1,\ldots,1\):
    1. Consider all \(k\) models that contain all but one of the predictors in \(M_k\), for a total of \(k-1\) predictors.
    2. Choose the best among these \(k\) models, and call it \(M_{k-1}\). Here best is defined as having the smallest RSS or highest \(R^2\).
  3. Select a single best model from among \(M_0, M_1, \ldots, M_p\) using cross-validated prediction error, \(C_p\), BIC, or adjusted \(R^2\).

Stepwise Selection: Forward Selection

Code
regfit_forward <- regsubsets(strength ~ ., data = concrete, nvmax = 8, method = "forward")

summary(regfit_forward)
Subset selection object
Call: regsubsets.formula(strength ~ ., data = concrete, nvmax = 8, 
    method = "forward")
8 Variables  (and intercept)
                 Forced in Forced out
cement               FALSE      FALSE
slag                 FALSE      FALSE
ash                  FALSE      FALSE
water                FALSE      FALSE
superplasticizer     FALSE      FALSE
coarse_aggregate     FALSE      FALSE
fine_aggregate       FALSE      FALSE
age                  FALSE      FALSE
1 subsets of each size up to 8
Selection Algorithm: forward
         cement slag ash water superplasticizer coarse_aggregate fine_aggregate
1  ( 1 ) "*"    " "  " " " "   " "              " "              " "           
2  ( 1 ) "*"    " "  " " " "   "*"              " "              " "           
3  ( 1 ) "*"    " "  " " " "   "*"              " "              " "           
4  ( 1 ) "*"    "*"  " " " "   "*"              " "              " "           
5  ( 1 ) "*"    "*"  " " "*"   "*"              " "              " "           
6  ( 1 ) "*"    "*"  "*" "*"   "*"              " "              " "           
7  ( 1 ) "*"    "*"  "*" "*"   "*"              "*"              " "           
8  ( 1 ) "*"    "*"  "*" "*"   "*"              "*"              "*"           
         age
1  ( 1 ) " "
2  ( 1 ) " "
3  ( 1 ) "*"
4  ( 1 ) "*"
5  ( 1 ) "*"
6  ( 1 ) "*"
7  ( 1 ) "*"
8  ( 1 ) "*"

Stepwise Selection: Backward Selection

Code
regfit_backward <- regsubsets(strength ~ ., data = concrete, nvmax = 8, method = "backward")

summary(regfit_backward)
Subset selection object
Call: regsubsets.formula(strength ~ ., data = concrete, nvmax = 8, 
    method = "backward")
8 Variables  (and intercept)
                 Forced in Forced out
cement               FALSE      FALSE
slag                 FALSE      FALSE
ash                  FALSE      FALSE
water                FALSE      FALSE
superplasticizer     FALSE      FALSE
coarse_aggregate     FALSE      FALSE
fine_aggregate       FALSE      FALSE
age                  FALSE      FALSE
1 subsets of each size up to 8
Selection Algorithm: backward
         cement slag ash water superplasticizer coarse_aggregate fine_aggregate
1  ( 1 ) "*"    " "  " " " "   " "              " "              " "           
2  ( 1 ) "*"    " "  " " " "   " "              " "              " "           
3  ( 1 ) "*"    " "  " " "*"   " "              " "              " "           
4  ( 1 ) "*"    "*"  " " "*"   " "              " "              " "           
5  ( 1 ) "*"    "*"  "*" "*"   " "              " "              " "           
6  ( 1 ) "*"    "*"  "*" "*"   "*"              " "              " "           
7  ( 1 ) "*"    "*"  "*" "*"   "*"              "*"              " "           
8  ( 1 ) "*"    "*"  "*" "*"   "*"              "*"              "*"           
         age
1  ( 1 ) " "
2  ( 1 ) "*"
3  ( 1 ) "*"
4  ( 1 ) "*"
5  ( 1 ) "*"
6  ( 1 ) "*"
7  ( 1 ) "*"
8  ( 1 ) "*"

Comparing Models for Subset Selection Methods

Choosing the Optimal Model

Important

The model containing all of the predictors will always have the smallest \(RSS\) and the largest \(R^2\)

Recall, we want to choose a model based on its ability to predict new data.

In the above examples, we didn’t create a test set to evaluate the models, we simply used the entire dataset.

Unfortunately, the regsubsets() function does not have a built-in method for evaluating models on a test set.

Metrics for Model Selection (1/2)

RSS: Residual Sum of Squares

\[ RSS = \sum_{i=1}^{n} (y_i - \hat{y}_i)^2 \]

\(R^2\): Coefficient of Determination

\[ R^2 = 1 - \frac{RSS}{TSS} \]

\(TSS = \sum_{i=1}^{n} (y_i - \bar{y})^2\) is the Total Sum of Squares.

MSE: Mean Squared Error

\[ MSE = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2 = \frac{RSS}{n} \]

Metrics for Model Selection (2/2)

Adjusted \(R^2\):

\[ \textrm{Adjusted }R^2 = 1 - \frac{RSS/(n-d-1)}{TSS/(n-1)} \]

Mallows’ \(C_p\):

\[ C_p = \frac{1}{n} (RSS + 2d\hat{\sigma}^2) \]

Akaike Information Criterion (AIC):

\[ AIC = \frac{1}{n} (RSS + 2d\hat{\sigma}^2) \]

Bayesian Information Criterion (BIC):

\[ BIC = \frac{1}{n} (RSS + \log(n)d\hat{\sigma}^2) \]

where \(d\) is the number of predictors in the model and \(\hat{\sigma}^2\) is an estimate of the variance of the error term.

Working with Combinations

Walk through the following code to examing how we can work with combinations of predictors.

Code
# Initialize an empty list to store model information
model_list <- list()

total_combinations <- 0

for (i in 1:8) {
  concrete_predictors <- combn(concrete_column_names[1:8], i)
  num_combinations <- ncol(concrete_predictors)
  
  cat(sprintf("\nNumber of predictors: %d\n", i))
  cat(sprintf("Number of combinations: %d\n", num_combinations))
  
  total_combinations <- total_combinations + num_combinations
  
  for (j in 1:num_combinations) {
    formula <- reformulate(concrete_predictors[,j], response = "strength")
    model_combination <- lm(formula, data = concrete)
    
    # Add model information to the list
    model_list[[length(model_list) + 1]] <- list(
      num_predictors = i,
      combination = paste(concrete_predictors[,j], collapse = ", "),
      formula = paste(deparse(formula), collapse = " "),  # Ensure single string
      model = model_combination
    )
  }
}

Number of predictors: 1
Number of combinations: 8

Number of predictors: 2
Number of combinations: 28

Number of predictors: 3
Number of combinations: 56

Number of predictors: 4
Number of combinations: 70

Number of predictors: 5
Number of combinations: 56

Number of predictors: 6
Number of combinations: 28

Number of predictors: 7
Number of combinations: 8

Number of predictors: 8
Number of combinations: 1
Code
# Convert the list to a tibble
model_results <- tibble::tibble(
  num_predictors = map_int(model_list, "num_predictors"),
  combination = map_chr(model_list, "combination"),
  formula = map_chr(model_list, "formula"),
  model = map(model_list, "model")
)

cat(sprintf("\nTotal number of combinations: %d\n", total_combinations))

Total number of combinations: 255
Code
cat(sprintf("Number of rows in model_results: %d\n", nrow(model_results)))
Number of rows in model_results: 255
Code
# Display the resulting tibble
model_results

Custom Function to Split Data and Compare Models

For many cases, you’ll be able to use package functions to perform model selection.

Code
# Created with the help of Calude.ai

compare_models <- function(data, response_var, predictor_vars, proportion = 0.8, seed = NULL) {
  # Set seed if provided
  if (!is.null(seed)) set.seed(seed)
  
  # Capture the predictor variables
  pred_vars <- enquo(predictor_vars)
  
  # Select the specified columns
  selected_data <- data |> 
    select(!!response_var, !!pred_vars)
  
  # Get column names
  all_column_names <- colnames(selected_data)
  response_name <- as.character(response_var)
  predictor_names <- setdiff(all_column_names, response_name)
  
  # Create train/test split
  n <- nrow(selected_data)
  train_indices <- sample(1:n, size = round(proportion * n))
  train_data <- selected_data[train_indices, ]
  test_data <- selected_data[-train_indices, ]
  
  # Initialize an empty list to store model information
  model_list <- list()
  total_combinations <- 0
  
  for (i in 1:length(predictor_names)) {
    predictor_combinations <- combn(predictor_names, i)
    num_combinations <- ncol(predictor_combinations)
    
    total_combinations <- total_combinations + num_combinations
    
    for (j in 1:num_combinations) {
      formula <- reformulate(predictor_combinations[,j], response = response_name)
      model_combination <- lm(formula, data = train_data)
      
      # Make predictions on test data
      predictions <- predict(model_combination, newdata = test_data)
      
      # Calculate RMSE and R-squared on test data
      rmse <- sqrt(mean((test_data[[response_name]] - predictions)^2))
      rsq <- cor(test_data[[response_name]], predictions)^2
      
      # Add model information to the list
      model_list[[length(model_list) + 1]] <- list(
        num_predictors = i,
        combination = paste(predictor_combinations[,j], collapse = ", "),
        formula = paste(deparse(formula), collapse = " "),
        model = model_combination,
        test_rmse = rmse,
        test_rsq = rsq
      )
    }
  }
  
  # Convert the list to a tibble
  model_results <- tibble::tibble(
    num_predictors = map_int(model_list, "num_predictors"),
    combination = map_chr(model_list, "combination"),
    formula = map_chr(model_list, "formula"),
    model = map(model_list, "model"),
    test_rmse = map_dbl(model_list, "test_rmse"),
    test_rsq = map_dbl(model_list, "test_rsq")
  )
  
  cat(sprintf("\nTotal number of models: %d\n", nrow(model_results)))
  
  return(model_results)
}

# Example usage:
result <- compare_models(concrete, "strength", concrete_column_names, proportion = 0.8, seed = 42)

Total number of models: 255
Code
# View results:
# result |>  arrange(test_rmse) |>  head(10)  # Top 10 models by RMSE
result |>  arrange(desc(test_rsq)) |>  head(10)  # Top 10 models by R-squared

Test Model Results

Code
max_results <- result |> 
  group_by(num_predictors) |> 
  filter(test_rsq == max(test_rsq)) |> 
  ungroup()

result |> 
  ggplot() +
  geom_point(aes(x = num_predictors, y = test_rsq), alpha = 1/3) +
  geom_line(aes(x = num_predictors, y = test_rsq), data = max_results, color = "red") +
  geom_point(aes(x = num_predictors, y = test_rsq), data = max_results, color = "red") +
  labs(
    title = "Model Selection Using Subset Selection on Test Data",
    subtitle = "Custom function to compare models based on test data",
    x = "Number of Predictors",
    y = "R<sup>2</sup>"
  ) +
  scale_x_continuous(breaks = 1:8) +
  theme_light() +
  theme(
   plot.title = ggtext::element_markdown(size = 16),
   plot.title.position = "plot",
   plot.subtitle = ggtext::element_markdown(size = 12),
   axis.title = ggtext::element_markdown(size = 10)
  )

Assignment: Subset Selection Methods

ISLR Lab: 6.5.1 Subset Selection Methods

Shrinkage Methods

Ridge Regression

Lasso

Elastic Net

Selecting the Tuning Parameter

Ridge Regression

Recall that linear regression minimizes RSS to find the best-fitting line.

\[ RSS = \sum_{i=1}^{n} (y_i - \hat{y}_i)^2 = \sum_{i=1}^{n} \left (y_i - \beta_0 - \sum_{j=1}^{p} \beta_j x_{ij} \right )^2 \]

Ridge regression adds a penalty term to the RSS to shrink the coefficients towards zero.

\[ \sum_{i=1}^{n} \left (y_i - \beta_0 - \sum_{j=1}^{p} \beta_j x_{ij} \right )^2 + \lambda \sum_{j=1}^{p} \beta_j^2 \]

\(\lambda\) controls the amount of shrinkage.

\(\lambda \ge 0\) is a tuning parameter that controls the amount of shrinkage and is determined using cross-validation.

Note

We will use the glmnet package to perform ridge regression.

Ridge Regression Penalty Term

The penalty term is often referred to as l2 regularization and has a shorthand notation of

\[ || \beta ||_2^2 = \sum_{j=1}^{p} \beta_j^2 \]

Scaling the Data

Standardizing the Predictors (ISLR, 239)

The ridge regression coefficient estimates can change substantially when multiplying a given predictor by a constant.
\(X_j \hat{\beta}_{j,\lambda}^R\) will depend not only on the value of \(\lambda\), but also on the scaling of the \(j\)th predictor. . . the value of \(X_j \hat{\beta}_{j,\lambda}^R\) may even depend on the scaling of the other predictors! Therefore, it is best to apply ridge regression after standardizing the predictors, using the formula

\[ \tilde{x}_{ij} = \frac{x_{ij}}{\sqrt{\frac{1}{n} \sum_{i=1}^{n} (x_{ij} - \bar{x}_j)^2}} \]

Standardization the Predictors (more common method)

Alternatively, we can use the scale() function in R to center and scale the predictors.
This is often referred to as z-score scaling.

\[ \tilde{x}_{ij} = \frac{x_{ij} - \bar{x}_j}{\sqrt{\frac{1}{n} \sum_{i=1}^{n} (x_{ij} - \bar{x}_j)^2}} \]

Ridge Regression Example (1/2)

Code
concrete_ridge <- glmnet(x = scale(as.matrix(concrete[, 1:8])), y = concrete$strength, alpha = 0, standardize = FALSE)

# names(concrete_ridge)
# concrete_ridge$lambda

as_tibble(as.matrix(concrete_ridge$beta), rownames = "predictor") |> 
  pivot_longer(cols = 2:101, names_to = "lambda", values_to = "beta_values") |> 
  mutate(lambda_values = rep(concrete_ridge$lambda, times = 8)) |> 
  ggplot() +
  geom_line(aes(x = lambda_values, y = beta_values, color = predictor)) +
  labs(
    title = "Ridge Regression Coefficients",
    subtitle = "Coefficient values for each predictor over different lambda values",
    x = TeX(r'($lambda$)'),
    y = "Coefficient Value"
  ) +
  scale_x_log10() +
  theme_light()

Ridge Regression Example (2/2)

With Cross-Validation

Code
lambda_values <- 10^seq(3, -1, length = 100)

concrete_ridge_cv <- cv.glmnet(x = scale(as.matrix(concrete[, 1:8])), y = concrete$strength, alpha = 0, standardize = FALSE, nfolds = 10, lambda = lambda_values)

ggplot() +
  geom_point(aes(x = concrete_ridge_cv$lambda, y = concrete_ridge_cv$cvm)) +
  labs(
    title = "Ridge Regression Cross-Validation",
    subtitle = "Cross-validated mean squared error for different lambda values",
    x = TeX(r'($\lambda$)'),
    y = "Mean Squared Error"
  ) +
  scale_x_log10(label = scales::label_comma(accuracy = 1)) +
  theme_light()

Code
plot(concrete_ridge_cv)

The Lasso

Lasso stands for Least Absolute Shrinkage and Selection Operator.

The lasso penalty term is l1 regularization.

\[ \sum_{i=1}^{n} \left (y_i - \beta_0 - \sum_{j=1}^{p} \beta_j x_{ij} \right )^2 + \lambda \sum_{j=1}^{p} | \beta_j | \]

Similar to Ridge regression, \(\lambda \ge 0\) is a tuning parameter that controls the amount of shrinkage and is determined using cross-validation.

Unlike Ridge regression, the lasso can set coefficients to zero, effectively performing variable selection.

The data should be standardized before using the lasso.

Lasso Example (1/2)

Code
concrete_lasso <- glmnet(x = scale(as.matrix(concrete[, 1:8])), y = concrete$strength, alpha = 1, standardize = FALSE)

as_tibble(as.matrix(concrete_lasso$beta), rownames = "predictor") |> 
  pivot_longer(cols = 2:78, names_to = "lambda", values_to = "beta_values") |> 
  mutate(lambda_values = rep(concrete_lasso$lambda, times = 8)) |> 
  ggplot() +
  geom_line(aes(x = lambda_values, y = beta_values, color = predictor)) +
  labs(
    title = "Lasso Coefficients",
    subtitle = "Coefficient values for each predictor over different lambda values",
    x = TeX(r'($\lambda$)'),
    y = "Coefficient Value"
  ) +
  scale_x_log10() +
  theme_light()

Lasso Example (2/2)

With Cross-Validation

Code
lambda_values <- 10^seq(2, -1, length = 100)

concrete_lasso_cv <- cv.glmnet(x = scale(as.matrix(concrete[, 1:8])), y = concrete$strength, alpha = 1, standardize = FALSE, nfolds = 10, lambda = lambda_values)

ggplot() +
  geom_point(aes(x = concrete_lasso_cv$lambda, y = concrete_lasso_cv$cvm)) +
  labs(
    title = "Lasso Cross-Validation",
    subtitle = "Cross-validated mean squared error for different lambda values",
    x = TeX(r'($\lambda$)'),
    y = "Mean Squared Error"
  ) +
  scale_x_log10() +
  theme_light()

Code
plot(concrete_lasso_cv)

Elastic Net

Elastic Net is a combination of Ridge and Lasso regression.

\[ \sum_{i=1}^{n} \left (y_i - \beta_0 - \sum_{j=1}^{p} \beta_j x_{ij} \right )^2 + \lambda \left ( \alpha \sum_{j=1}^{p} | \beta_j | + (1 - \alpha) \sum_{j=1}^{p} \beta_j^2 \right ) \]

where \(\alpha\) controls the balance between Ridge and Lasso regression.

\(\alpha = 1\) is Lasso regression and is the default in glmnet. \(\alpha = 0\) is Ridge regression.

Assignment: Ridge Regression and the Lasso

ISLR Lab: 6.5.2 Ridge Regression and the Lasso

Readings (1/2)

Dimension Reduction Methods

Principal Components Regression

Partial Least Squares

Principal Components Regression

Partial Least Squares

Readings (2/2)

Considerations in High Dimensions

High-Dimensional Data

What Goes Wrong in High Dimensions?

Regression in High Dimensions

Interpreting Results in High Dimenions

Moving Beyond Linearity

Polynomial Regression (1/)

We can extend linear regression to include polynomial terms where the coefficients are still linear, but the predictors are polynomial terms of the original predictors.

\[ y_i = \beta_0 + \beta_1 x_i + \beta_2 x_i^2 + \ldots + \beta_d x_i^d + \epsilon_i \]

In most cases, the value of \(d\) is small, often 2 or 3 as higher values can lead to over fitting.

Polynomial Regression (2/)

Example 1: Quadratic Regression on Loadcell data

Code
loadcell <- read_table("datasets/LOADCELL.DAT", skip = 25, col_names = c("Load Level", "Response"))
loadcell

Polynomial Regression (3/)

Example 1: EDA of Loadcell data

Code
loadcell |>
  ggplot() +
  geom_point(aes(x = `Load Level`, y = Response)) +
  labs(
    title = "Loadcell Data",
    subtitle = "Data appears to have a linear relationship",
    x = "Load Level",
    y = "Response"
  ) +
  theme_light()

Polynomial Regression (4/)

Example 1: Quadratic Regression on Loadcell data

Code
loadcell_lm <- lm(Response ~ `Load Level`, data = loadcell)
summary(loadcell_lm)

Call:
lm(formula = Response ~ `Load Level`, data = loadcell)

Residuals:
       Min         1Q     Median         3Q        Max 
-3.181e-04 -2.075e-04 -2.862e-05  1.983e-04  4.208e-04 

Coefficients:
               Estimate Std. Error   t value Pr(>|t|)    
(Intercept)  -7.156e-04  9.025e-05    -7.929 5.97e-09 ***
`Load Level`  1.003e-01  6.725e-06 14909.755  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.000239 on 31 degrees of freedom
Multiple R-squared:      1, Adjusted R-squared:      1 
F-statistic: 2.223e+08 on 1 and 31 DF,  p-value: < 2.2e-16


Note: R-squared is 1!

Polynomial Regression (5/)

Example 1: Inspect the resiuals

Code
loadcell_resid <- augment(loadcell_lm)
loadcell_resid |> 
  ggplot() +
  geom_point(aes(x = `Load Level`, y = .resid)) +
  labs(
    title = "Loadcell Residuals from Simple Linear Regression",
    subtitle = "Residuals are not randomly distributed",
    x = "Load Level",
    y = "Residuals"
  ) +
  theme_light()

Polynomial Regression (6/)

Example 1: Quadratic Regression on Loadcell data

Code
loadcell_lm_quad <- lm(Response ~ poly(`Load Level`, 2), data = loadcell)
summary(loadcell_lm_quad)

Call:
lm(formula = Response ~ poly(`Load Level`, 2), data = loadcell)

Residuals:
       Min         1Q     Median         3Q        Max 
-9.966e-05 -1.466e-05  5.944e-06  2.515e-05  5.595e-05 

Coefficients:
                        Estimate Std. Error   t value Pr(>|t|)    
(Intercept)            1.193e+00  6.552e-06 182130.14   <2e-16 ***
poly(`Load Level`, 2)1 3.563e+00  3.764e-05  94658.88   <2e-16 ***
poly(`Load Level`, 2)2 1.314e-03  3.764e-05     34.92   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 3.764e-05 on 30 degrees of freedom
Multiple R-squared:      1, Adjusted R-squared:      1 
F-statistic: 4.48e+09 on 2 and 30 DF,  p-value: < 2.2e-16

Polynomial Regression (7/)

Example 1: Inspect the resiuals of the quadrtic model

Code
set.seed(42) # added so jitter will not change between renders
loadcell_resid_quad <- augment(loadcell_lm_quad)
loadcell_resid_quad |> 
  ggplot() +
  geom_point(aes(x = loadcell$`Load Level`, y = .resid), 
             position = position_jitter(width = 0.15),
             shape = 1) +
  labs(
    title = "Loadcell Residuals from Quadratic Regression",
    subtitle = "Residuals are randomly distributed",
    x = "Load Level",
    y = "Residuals"
  ) +
  theme_light()

Polynomial Regression (8/)

Example 1: Alternative method to construc the polynomial model

Code
loadcell_lm_quad2 <- lm(Response ~ `Load Level` + I(`Load Level`^2), data = loadcell)
summary(loadcell_lm_quad2)

Call:
lm(formula = Response ~ `Load Level` + I(`Load Level`^2), data = loadcell)

Residuals:
       Min         1Q     Median         3Q        Max 
-9.966e-05 -1.466e-05  5.944e-06  2.515e-05  5.595e-05 

Coefficients:
                    Estimate Std. Error   t value Pr(>|t|)    
(Intercept)       -1.840e-05  2.451e-05    -0.751    0.459    
`Load Level`       1.001e-01  4.839e-06 20687.891   <2e-16 ***
I(`Load Level`^2)  7.032e-06  2.014e-07    34.922   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 3.764e-05 on 30 degrees of freedom
Multiple R-squared:      1, Adjusted R-squared:      1 
F-statistic: 4.48e+09 on 2 and 30 DF,  p-value: < 2.2e-16

Assignment: Polynomial Regression

ISLR Lab: 7.8.1 Polynomial Regression

End of Module 9

References